# initialise a list of required packages
packages = c('sf', 'tidyverse', 'tmap', 'spdep', 'httr', 'ggmap', "rvest",
'onemapsgapi', 'units', 'matrixStats', 'readxl', 'jsonlite',
'olsrr', 'corrplot', 'ggpubr', 'GWmodel',
'devtools', 'kableExtra', 'plotly', 'ggthemes')
# for each package, check if installed and if not, install it
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}Take-Home Excercise 03
1.0 Overview
1.1 Background
Housing and Development Board (HDB) flats are a crucial aspect of the Singaporean housing market. They serve as the primary form of public housing for over 80% of the population, providing affordable and accessible homes for citizens. Due to the high demand for public housing, the pricing of HDB flats is a crucial issue that affects not just the housing market but also the wider economy and society.
The pricing of HDB flats in Singapore is determined by a range of factors, including location, age, size, and condition of the flat, as well as market demand and supply. The government plays a crucial role in setting the pricing policies for HDB flats, which have significant implications for homeowners, potential buyers, and the broader economy.
Investigating and explaining the factors that affect the resale prices of public housing in Singapore is essential for several reasons. Firstly, understanding these factors can provide valuable insights into the housing market’s dynamics, helping policymakers and stakeholders make informed decisions. Secondly, resale prices can significantly impact homeowners’ financial well-being, so it is crucial to understand the factors that contribute to these prices. Thirdly, the study of these factors can help homeowners make informed decisions about when to sell their homes and at what price. Finally, understanding the factors that affect resale prices can help identify potential areas for intervention or policy changes to ensure the stability and affordability of public housing in Singapore. Overall, investigating and explaining the factors affecting resale prices of public housing in Singapore is an important area of research with significant implications for homeowners, policymakers, and the wider society.
1.2 Task
In this take-home exercise, we are tasked to predict HDB resale prices at the sub-market level (i.e. HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. You are also required to compare the performance of the conventional OLS method versus the geographical weighted methods.
1.3 Packages Used
sf :
tidyverse :
tmap :
spdep :
onemapsapi :
httr :
ggmap :
rvest : for html_text()
units : ???
matrixStats : ???
jsonlite : ??
olsrr : ??
coorplot : ??
GWmodel :
devtools :
kableExtra :
plotly :
ggthemes :
# reference for manipulating output messages: https://yihui.org/knitr/demo/output/
devtools::install_github("gadenbuie/xaringanExtra")
library(xaringanExtra)xaringanExtra::use_panelset()1.4 Datasets Used
| Type | Name | Format | Source |
|---|---|---|---|
| Aspatial | Resale Flat Prices | .csv | [data.gov.sg](https://data.gov.sg/dataset/resale-flat-prices) |
| Geospatial | MPSZ-2019 | .shp | From Prof. Kam's In-Class Excercise 09 |
| Geospatial | Eldercare Services | .shp | [data.gov.sg](https://data.gov.sg/dataset/) |
| Geospatial | Hawker Centres | .geojson | [data.gov.sg](https://data.gov.sg/dataset/) |
| Geospatial | Gyms | .geojson | [data.gov.sg](https://data.gov.sg/dataset/) |
| Geospatial | General Information of Schools | .csv | [data.gov.sg](https://data.gov.sg/dataset/) |
| Geospatial | Parks | .kml | [data.gov.sg](https://data.gov.sg/dataset/) |
| Geospatial | MRT Locations | .geojson | [data.gov.sg](https://data.gov.sg/dataset/) |
| Geospatial | Kindergartens | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
| Geospatial | Pre-School Locations | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
| Geospatial | Private Education Institutes | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
| Geospatial | Supermarkets | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
| Geospatial | Childcare Services | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
| Geospatial | Dengue Clusters | .shp | [OneMap API](https://www.onemap.gov.sg/docs/) |
| Geospatial | Bus Stop Locations | .shp | [LTA Data Mall](https://datamall.lta.gov.sg/content/datamall/en/search_datasets.html?searchText=bus%20stop) |
| Geospatial | Shopping Malls | .csv | Wikipedia |
We are considering the following factors to determine the resale price of HDB
Floor Level
Remaining Lease Period
Area of the Unit
Age of Unit
Storey-Floor
Proximity to CBD
Proximity to eldercare
Proximity to foodcourt/hawker centers
Proximity to MRT
Proximity to park
Proximity to good primary school
Proximity to shopping mall
Proximity to supermarket
Number of Kindgartens within 350m
Number of Childcare services within 350m
Number of Bus stops within 350m
Number of Primary Schools within 1km
2.0 Importing and Wrangling of Aspatial Data
We use the read_csv() function of readr package to import resale-flat-prices into R as a tibble data frame called resale. Further, glimpse() function of dplyr package is used to display the data structure.
resale <- read_csv("data/aspatial/resale-flat-prices.csv")glimpse(resale)Rows: 148,864
Columns: 11
$ month <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block <chr> "406", "108", "602", "465", "601", "150", "447", "…
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
$ storey_range <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
$ floor_area_sqm <dbl> 44, 67, 67, 68, 67, 68, 68, 67, 68, 67, 68, 67, 67…
$ flat_model <chr> "Improved", "New Generation", "New Generation", "N…
$ lease_commence_date <dbl> 1979, 1978, 1980, 1980, 1980, 1981, 1979, 1976, 19…
$ remaining_lease <chr> "61 years 04 months", "60 years 07 months", "62 ye…
$ resale_price <dbl> 232000, 250000, 262000, 265000, 265000, 275000, 28…
We can see the following information upon running the glimpse function -
The dataset contains 11 columns with 148,864 rows
The columns present are the following -
month (month here is in the format of yyyy/mm)
town
flat_type
block
street_name
storey_range
floor_area_sqm
flat_model
lease_commence_date
remaining_lease
resale_price
The data is from Jan 2017 and consists of all flat types including Executive to 2,3,4 and 5 bedrooms. However, we will only be focusing only 4 bedroom and we need data from 1st January 2021 on wards till 31st December. Hence, we will be filtering the data
2.1 Filtering Resale Data
We will be using the filter() function of dplyr to select our flat_types and dates in rs_subset. We will also be using the unique() function to check if we have successfully extracted the flat_type and month.
rs_subset <- filter(resale,flat_type == "4 ROOM") %>%
filter(month >= "2021-01" & month <= "2022-12")glimpse(rs_subset)Rows: 23,656
Columns: 11
$ month <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", …
$ block <chr> "547", "414", "509", "467", "571", "134", "204", "…
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 10", "ANG MO …
$ storey_range <chr> "04 TO 06", "01 TO 03", "01 TO 03", "07 TO 09", "0…
$ floor_area_sqm <dbl> 92, 92, 91, 92, 92, 98, 92, 92, 92, 92, 92, 109, 9…
$ flat_model <chr> "New Generation", "New Generation", "New Generatio…
$ lease_commence_date <dbl> 1981, 1979, 1980, 1979, 1979, 1978, 1977, 1978, 19…
$ remaining_lease <chr> "59 years", "57 years 09 months", "58 years 06 mon…
$ resale_price <dbl> 370000, 375000, 380000, 385000, 410000, 410000, 41…
unique(rs_subset$month) [1] "2021-01" "2021-02" "2021-03" "2021-04" "2021-05" "2021-06" "2021-07"
[8] "2021-08" "2021-09" "2021-10" "2021-11" "2021-12" "2022-01" "2022-02"
[15] "2022-03" "2022-04" "2022-05" "2022-06" "2022-07" "2022-08" "2022-09"
[22] "2022-10" "2022-11" "2022-12"
unique(rs_subset$flat_type)[1] "4 ROOM"
From the above results we can see that there are 23,656 transactions for 4 Bedroom flats from 1st January 2021 to 31st December 2022.
2.2 Transforming Resale Data Columns
We will be using the mutate() function to create a new variable called rs_transform with the following columns -
address : concatenation of the block and street_name columns using
paste()function of base R packageremaining_lease_yr&remaining_lease_mth: we will split the year and months part of theremaining_leaserespectively usingstr_sub()function of stringr package then converting the character to integer usingas.integer()function of base R package
rs_transform <- rs_subset %>%
mutate(rs_subset, address = paste(block,street_name)) %>%
mutate(rs_subset, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
mutate(rs_subset, remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))head(rs_transform)# A tibble: 6 × 14
month town flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 2021-01 ANG MO … 4 ROOM 547 ANG MO… 04 TO … 92 New Ge… 1981 59 yea…
2 2021-01 ANG MO … 4 ROOM 414 ANG MO… 01 TO … 92 New Ge… 1979 57 yea…
3 2021-01 ANG MO … 4 ROOM 509 ANG MO… 01 TO … 91 New Ge… 1980 58 yea…
4 2021-01 ANG MO … 4 ROOM 467 ANG MO… 07 TO … 92 New Ge… 1979 57 yea…
5 2021-01 ANG MO … 4 ROOM 571 ANG MO… 07 TO … 92 New Ge… 1979 57 yea…
6 2021-01 ANG MO … 4 ROOM 134 ANG MO… 07 TO … 98 New Ge… 1978 56 yea…
# … with 4 more variables: resale_price <dbl>, address <chr>,
# remaining_lease_yr <int>, remaining_lease_mth <int>, and abbreviated
# variable names ¹flat_type, ²street_name, ³storey_range, ⁴floor_area_sqm,
# ⁵flat_model, ⁶lease_commence_date, ⁷remaining_lease
2.3 Sum up remaining lease in months
There are some values in remaining_lease_mth which are NA. We will be first converting this NA values into 0 with the help of is.na() function. Upon doing this, we will convert the remaining_lease_yr into months by multiplying it by 12. We will replace these 2 columns by summing both these columns to create the remaining_lease_mths using rowSums() and mutate() functions. This columns will show the total remaining lease period in months.
rs_transform$remaining_lease_mth[is.na(rs_transform$remaining_lease_mth)] <- 0
rs_transform$remaining_lease_yr <- rs_transform$remaining_lease_yr * 12
rs_transform <- rs_transform %>%
mutate(rs_transform, remaining_lease_mths = rowSums(rs_transform[, c("remaining_lease_yr", "remaining_lease_mth")])) %>%
select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model,
lease_commence_date, remaining_lease_mths, resale_price)2.4 Retrieval of Address
We will be retrieving data such as postal codes and coordinates of the address which will be essential in finding the proximity to locational factors.
2.4.1 Creating a list storing unique addresses
We will be using the unique() function of the base R package to extract unique addresses and then using the sort() function of base R package to sort the unique vectors. Further, we will be storing it in a list, so that we do not call the GET request more than required.
add_list <- sort(unique(rs_transform$address))2.4.2 Creating a function to retrieve coordinates from OneMap.sg API
We will be using the GET() function of httr package to make a GET request to https://developers.onemap.sg/commonapi/search. This allows us to query spatial data in a tidy format. The retrieved coordinates will be then be stored in a dataframe create called postal_coords. Further, we need to take note of the following variaboles which will be used in our GET() request.
searchVal: Keywords entered by user that is used to filter out the results.returnGeom{Y/N}: Checks if user wants to return the geometry.getAddrDetails{Y/N}: Checks if user wants to return address details for a point.
A thing to note is that the returned JSON response will contain multiple values, however, we are only interested in the postal code and coordinates like Longitude and Latitude. Further, we will create a dataframe new_row to store each final set of coordinates retrieved during the loop. We are creating this new dataframe so that we can check the number of responses returned and append to the postal_coords (using r_bind()) as some of the locations might have a single result of postal while the others might have multiple set of postal codes. Further, we also need to check if the address is invalid by looking at the number of rows returned (i.e. found = 0).
get_coords <- function(add_list){
# Create a data frame to store all retrieved coordinates
postal_coords <- data.frame()
for (i in add_list){
#print(i)
r <- GET('https://developers.onemap.sg/commonapi/search?',
query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
data <- fromJSON(rawToChar(r$content))
found <- data$found
res <- data$results
# Create a new data frame for each address
new_row <- data.frame()
# If single result, append
if (found == 1){
postal <- res$POSTAL
lat <- res$LATITUDE
lng <- res$LONGITUDE
new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
res_sub <- res[res$POSTAL != "NIL", ]
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
}
else{
top1 <- head(res_sub, n = 1)
postal <- top1$POSTAL
lat <- top1$LATITUDE
lng <- top1$LONGITUDE
new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
}
}
else {
new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
}
# Add the row
postal_coords <- rbind(postal_coords, new_row)
}
return(postal_coords)
}2.4.3 Retrieve Resale Coordinates
coords <- get_coords(add_list)2.4.3 Check Results
We will be using the is.na() function of base R package to chekc if any of the relevant columns contain any NA values.
coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ] address postal latitude longitude
1305 215 CHOA CHU KANG CTRL NIL 1.38308302434129 103.747076627693
From the above message, we can see that the postal code of 215 CHOA CHU KANG CTRL is missing. Upon searching it up online, we can see that the postal code is 680215.
2.4.4 Combine Resale and Coordinates Data
We will now use the left_join() function of dplyr package combine the successfully retrieved coordinates with out transformed resale dataset.
rs_coords <- left_join(rs_transform, coords, by = c('address' = 'address'))head(rs_coords)# A tibble: 6 × 15
month town address block stree…¹ flat_…² store…³ floor…⁴ flat_…⁵ lease…⁶
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 2021-01 ANG MO … 547 AN… 547 ANG MO… 4 ROOM 04 TO … 92 New Ge… 1981
2 2021-01 ANG MO … 414 AN… 414 ANG MO… 4 ROOM 01 TO … 92 New Ge… 1979
3 2021-01 ANG MO … 509 AN… 509 ANG MO… 4 ROOM 01 TO … 91 New Ge… 1980
4 2021-01 ANG MO … 467 AN… 467 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
5 2021-01 ANG MO … 571 AN… 571 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
6 2021-01 ANG MO … 134 AN… 134 ANG MO… 4 ROOM 07 TO … 98 New Ge… 1978
# … with 5 more variables: remaining_lease_mths <dbl>, resale_price <dbl>,
# postal <chr>, latitude <chr>, longitude <chr>, and abbreviated variable
# names ¹street_name, ²flat_type, ³storey_range, ⁴floor_area_sqm,
# ⁵flat_model, ⁶lease_commence_date
2.4.5 Handling NIL data
We now need to add the postal code of 215 CHOA CHU KANG CTRL. Since we can see from the below code, the postal code is not in integer, but in character, we will be replacing the NIL with the postal code in character type.
typeof(rs_coords$postal)[1] "character"
rs_coords[rs_coords$address == '215 CHOA CHU KANG CTRL', "postal"] <- "680215"Lets verify that the postal code has been replaced and there are no more NA or NIL values.
rs_coords[(is.na(rs_coords$postal) | is.na(rs_coords$latitude) | is.na(rs_coords$longitude) | rs_coords$postal=="NIL"), ]# A tibble: 0 × 15
# … with 15 variables: month <chr>, town <chr>, address <chr>, block <chr>,
# street_name <chr>, flat_type <chr>, storey_range <chr>,
# floor_area_sqm <dbl>, flat_model <chr>, lease_commence_date <dbl>,
# remaining_lease_mths <dbl>, resale_price <dbl>, postal <chr>,
# latitude <chr>, longitude <chr>
2.4.6 Write file to RDS
Since, our subset resale dataset is now complete with the coordinates, we can save it into a rds file to prevent running the GET() function multiple times.
rs_coords_rds <- write_rds(rs_coords, "data/rds/rs_coords.rds")Now lets read the RDS file to verify its saved properly.
rs_coords <- read_rds("data/rds/rs_coords.rds")glimpse(rs_coords)Rows: 23,656
Columns: 15
$ month <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO…
$ address <chr> "547 ANG MO KIO AVE 10", "414 ANG MO KIO AVE 10",…
$ block <chr> "547", "414", "509", "467", "571", "134", "204", …
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 10", "ANG MO…
$ flat_type <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM",…
$ storey_range <chr> "04 TO 06", "01 TO 03", "01 TO 03", "07 TO 09", "…
$ floor_area_sqm <dbl> 92, 92, 91, 92, 92, 98, 92, 92, 92, 92, 92, 109, …
$ flat_model <chr> "New Generation", "New Generation", "New Generati…
$ lease_commence_date <dbl> 1981, 1979, 1980, 1979, 1979, 1978, 1977, 1978, 1…
$ remaining_lease_mths <dbl> 708, 693, 702, 695, 689, 681, 661, 682, 692, 692,…
$ resale_price <dbl> 370000, 375000, 380000, 385000, 410000, 410000, 4…
$ postal <chr> "560547", "560414", "560509", "560467", "560571",…
$ latitude <chr> "1.37420951743562", "1.36390466431674", "1.374000…
$ longitude <chr> "103.858209667888", "103.853913839503", "103.8501…
2.4.7 Assign and Transform CRS
Since we are using Longitudes and Latitudes which are in decimals, the CRS will be WGS84. Hence, we will need to assign them first to EPSG code 4326 and then transform it to 3414 which is the EPSG code for SVY21 (Singapore).
rs_coords_sf <- st_as_sf(rs_coords,
coords = c("longitude",
"latitude"),
crs=4326) %>%
st_transform(crs = 3414)Now lets check that the CRS has been successfully transformed
st_crs(rs_coords_sf)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
2.4.8 Checking for Invalid Geometries
length(which(st_is_valid(rs_coords_sf) == FALSE))[1] 0
2.4.9 Plotting HDB Resale Points
tmap_mode("view")
tm_shape(rs_coords_sf)+
tm_dots(col="blue", size = 0.02)